Skip to content

fix: avx512vl masked load/store#1353

Open
DiamonDinoia wants to merge 4 commits into
xtensor-stack:masterfrom
DiamonDinoia:fix/avx512vl-masked-memory
Open

fix: avx512vl masked load/store#1353
DiamonDinoia wants to merge 4 commits into
xtensor-stack:masterfrom
DiamonDinoia:fix/avx512vl-masked-memory

Conversation

@DiamonDinoia
Copy link
Copy Markdown
Contributor

@DiamonDinoia DiamonDinoia commented May 21, 2026

xsimd::batch<uint32_t, avx512vl_256>::store(ptr, constexpr_mask, mode) used
to compile to a 6-instruction per-lane scalar extract loop instead of a single
EVEX-encoded masked store, because the call site in xsimd_batch.hpp:766
over-specified template arguments and SFINAE'd away every viable masked-store
overload except the scalar common fallback.

The load side was unaffected — its call site (:743) was already correct.

#include <xsimd/xsimd.hpp>

using A    = xsimd::avx512vl_256;
using Bu32 = xsimd::batch<uint32_t, A>;

// Constant alternating mask: lanes 0,2,4,6 active.
struct alt {
    static constexpr bool get(std::size_t i, std::size_t) { return (i & 1) == 0; }
};

static constexpr auto mask = xsimd::make_batch_bool_constant<uint32_t, alt, A>();

Bu32 load_u32(uint32_t const* p) {
    return Bu32::load(p, mask, xsimd::unaligned_mode{});
}

void store_u32(uint32_t* p, Bu32 v) {
    v.store(p, mask, xsimd::unaligned_mode{});
}
g++ -O3 -S -masm=intel -std=c++14 -march=skylake-avx512 -DXSIMD_DEFAULT_ARCH=xsimd::avx512vl_256

Codegen — before (master, commit 7d30b9cc)

_Z8load_u32PKj:                  # load_u32
    mov         eax, 85
    kmovb       k1, eax
    vmovdqu32   ymm0{k1}{z}, YMMWORD PTR [rdi]
    ret

_Z9store_u32PjN5xsimd5batchIjNS0_12avx512vl_256EEE:   # store_u32
    valignd       ymm1, ymm0, ymm0, 6
    vmovd         DWORD PTR  [rdi], xmm0
    vpextrd       DWORD PTR 8[rdi], xmm0, 2
    vextracti32x4 xmm2, ymm0, 1
    vmovd         DWORD PTR 24[rdi], xmm1
    vmovd         DWORD PTR 16[rdi], xmm2
    ret

Load is fine. Store is the scalar fallback: GCC unrolled the 8-lane loop in
xsimd_common_memory.hpp:377 into four per-lane 32-bit stores, materialising
each active lane no mask instruction involved.

Codegen — after

_Z8load_u32PKj:                  # load_u32 — unchanged
    mov         eax, 85
    kmovb       k1, eax
    vmovdqu32   ymm0{k1}{z}, YMMWORD PTR [rdi]
    ret

_Z9store_u32PjN5xsimd5batchIjNS0_12avx512vl_256EEE:   # store_u32
    mov         eax, 85
    kmovb       k1, eax
    vmovdqu32   YMMWORD PTR [rdi]{k1}, ymm0
    ret

One masked instruction.

Bug fixes

  • AVX-512VL masked store collapsed to a 6-insn scalar fallback. An over-specified template arg list at the public batch::store(mask) call site pushed a type into a non-type pack, SFINAE'ing every per-arch overload away. The store now reaches the EVEX vmov*{k} intrinsic.

  • CI linux.yml typo silently dropped CXXFLAGS in the VL_128 matrix row ($CXX_FLAGS vs $CXXFLAGS), so the VL_128-default test job was building with stock flags instead of the requested override.

  • avx512vl register-traits comment misattributed to AVX512DQ.

  • Half-confined masked op fell back to scalar on AVX/AVX2/AVX-512F. The half-fold hardcoded sse4_2/avx2 as the half-width target, but two-phase lookup made the better-arch overload invisible at template-definition time — so the recursive call dispatched to the wrong arch. Fixed by include reorder + make_sized_batch_t<T, half>.

Features added

  • Per-type load_masked / store_masked for avx512vl_128 and avx512vl_256 covering i32/u32/i64/u64/f32/f64 in both aligned and unaligned modes; partial ordering picks them over the avx2 bridges these archs
    inherit.
  • Compile-time guarantee that default_arch matches XSIMD_DEFAULT_ARCH when the macro is set — a static_assert in test_arch.cpp (plus the CMake plumbing to forward the macro) catches default-arch wiring regressions at compile time instead of at runtime.

Cleanups

  • AVX detail::maskstore helpers now take batch<> types, symmetric with detail::maskload.
  • Half-store sites and the common select use a typed variable (const batch<T, A> lo = …) instead of const auto lo = batch<T, A>{ … }.
  • Aligned/unaligned dispatch in the VL overloads uses XSIMD_IF_CONSTEXPR, so the inactive intrinsic isn't instantiated.
  • Half-fold in xsimd_avx.hpp / xsimd_avx2.hpp / xsimd_avx512f.hpp uses make_sized_batch_t<T, half> instead of a hardcoded arch — picks avx_128 / avx2_128 / avx512vl_256 when available, so half-confined stores land on the EVEX or VEX masked intrinsic.
  • xsimd_isa.hpp include order: _128 siblings before their wider arch, VL before avx512f.hpp. Required so the recursive store_masked<half_arch> call sees the better-arch overload at template-definition time.

(Changelog by @claude)

@DiamonDinoia DiamonDinoia force-pushed the fix/avx512vl-masked-memory branch 5 times, most recently from fe2938e to ea882e6 Compare May 21, 2026 12:04
@DiamonDinoia DiamonDinoia marked this pull request as ready for review May 21, 2026 12:51
if [[ '${{ matrix.sys.flags }}' == 'avx512vl_128' ]]; then
CMAKE_EXTRA_ARGS="$CMAKE_EXTRA_ARGS -DTARGET_ARCH=skylake-avx512"
CXXFLAGS="$CXX_FLAGS -DXSIMD_DEFAULT_ARCH=avx512vl_128"
CXXFLAGS="$CXXFLAGS -DXSIMD_DEFAULT_ARCH=avx512vl_128"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oopsie. Thanks for fixing this one.

Copy link
Copy Markdown
Contributor

@serge-sans-paille serge-sans-paille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we could keep each architecture file unaware from other architectures. I do understand this will disappear once we move to C++17-based architecture, but I'd be happier if you could find another way to apply the constraints.

// and the VL native as equally specialized for A=avx512vl_*. (bridge_not_vl in fwd.hpp)
template <class A, bool... Values, class Mode>
XSIMD_INLINE batch<int32_t, A> load_masked(int32_t const* mem, batch_bool_constant<int32_t, A, Values...>, convert<int32_t>, Mode, requires_arch<A>) noexcept
XSIMD_INLINE std::enable_if_t<bridge_not_vl<A>::value, batch<int32_t, A>>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should not be any arch-specific code in xsimd_common_memory.hpp. Could you find another approach?

Comment thread include/xsimd/arch/xsimd_avx.hpp Outdated

template <class A>
XSIMD_INLINE void maskstore(double* mem, batch_bool<double, A> const& mask, batch<double, A> const& src) noexcept
XSIMD_INLINE void maskstore(double* mem, batch<as_integer_t<double>, A> const& mask, batch<double, A> const& src) noexcept
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wgy that change? In my mental model, the masked store take a bool mask, not an integer mask

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We _mm256_maskstore_ps takes a __m256i for mask. The bool mask here available in the function calling this utility is backed by a floating point type.

Comment thread include/xsimd/arch/xsimd_avx2.hpp Outdated
// single templated implementation for integer masked loads (32/64-bit)
template <class A, class T, bool... Values, class Mode>
template <class A, class T, bool... Values, class Mode,
class = std::enable_if_t<std::is_base_of<avx2, A>::value && !std::is_base_of<avx512vl_256, A>::value>>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite dislike the fact that xsimd_avx2.hpp needs to know stuff about avx512vl

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, I'll have another look. Without this constraint all compilers work fine except gcc-10 :(

Comment thread include/xsimd/arch/xsimd_avx_128.hpp Outdated
}
template <class A, bool... Values, class Mode>
template <class A, bool... Values, class Mode,
class = std::enable_if_t<std::is_base_of<avx_128, A>::value && !std::is_base_of<avx512vl_128, A>::value>>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same architecture mix reference issue here.

#if XSIMD_WITH_AVX
#include "./xsimd_avx.hpp"
// clang-format off
// _128 first: avx half-fold recursive call needs avx_128 visible at parse time.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch.

* @ingroup architectures
*
* AVX512DQ instructions
* AVX512VL instructions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oopsie

template <typename T, std::size_t N>
struct make_sized_batch;
template <typename T, std::size_t N>
using make_sized_batch_t = typename make_sized_batch<T, N>::type;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

Comment thread test/CMakeLists.txt Outdated
message(STATUS "Using emulated target: ${TARGET_EMULATED}")
set(EMULATED_COMPILE_FLAGS -DXSIMD_DEFAULT_ARCH=${TARGET_ARCH};-DXSIMD_WITH_EMULATED=1)
unset(TARGET_ARCH CACHE)
elseif (DEFINED XSIMD_DEFAULT_ARCH AND NOT "${XSIMD_DEFAULT_ARCH}" STREQUAL "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://cmake.org/cmake/help/latest/command/if.html#constant I think

if(XSIMD_DEFAULT_ARCH)

is enough

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread test/test_arch.cpp
static_assert(xsimd::all_architectures::contains<xsimd::default_arch>(), "default arch is a valid arch");
#else
namespace xsimd
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay

@AntoinePrv
Copy link
Copy Markdown
Contributor

@DiamonDinoia I run into similar issues of load_masked in #1348 where I also fixed and improved the linux.yml CI.

I haven't look at assembly (just making it build with the new updated CI settings). The C++ diff should be fairly small, I wonder if you'd be able to get anything useful from it. I was able to simplify much more the functions from common_memory while keeping them lower priority (I believe). However I think the case common memory casting to float and calling again into the original arch is a big footgun. Would be better IMHO to explicitly make that call everywhere (with possibly a second / different utility function).

That being said, my solution currently stalls on avx512vl_128/avx512vl_256...

Let CMake force a specific default arch via -DXSIMD_DEFAULT_ARCH
(idiomatic if(XSIMD_DEFAULT_ARCH) guard), add a test_arch.cpp check
that the forced arch is the default, and fix the linux.yml CXXFLAGS typo.
Split the avx_128 variable swizzle into explicit float/double overloads
with a width static_assert, and fix an AVX512DQ -> AVX512VL doc comment.
Add the missing int64/uint64/float/double load_masked overloads and
correct the store_masked batch_bool_constant typing on avx512vl_128 and
avx512vl_256, branching aligned vs unaligned to the right EVEX intrinsic
(vmovdqu{32,64}{k}{z} / vmov{a,u}p{s,d}{k}{z}); unsigned overloads
delegate via bitwise_cast. Resolve the avx/avx2/avx512f half-fold target
through make_sized_batch_t<T, half>::arch_type so a 512-bit masked op
picks the VL arch and emits EVEX instead of VEX vpmaskmov*/vmaskmov*.
@DiamonDinoia DiamonDinoia force-pushed the fix/avx512vl-masked-memory branch from ea882e6 to fa06792 Compare June 1, 2026 15:56
Drop the cross-arch SFINAE/tag mechanism: a concrete
requires_arch<avx512vl_128|256> overload now beats the inherited
avx2/avx2_128 one by overload conversion ranking, so no arch file knows
about another. xsimd_common_memory.hpp keeps only requires_arch<common>
and dispatches on the arch-agnostic trait masked_memory_uses_fp_bitcast
(integral with a same-width float register -> reuse that float vmaskmov*
path, else a scalar buffer). avx/avx2/avx2_128 drop every
is_base_of<avx512vl_*, A> guard; avx2_128 routes native 128-bit integer
masked memory through vpmaskmov* (long long* cast for 64-bit) and tags
int64/uint64 on avx2_128 (those intrinsics need AVX2). detail::maskstore
takes a bool mask and casts internally; xsimd_batch.hpp keeps a
make_sized_batch fwd-decl and simplifies the store_masked call;
xsimd_isa.hpp documents the _128-first include order; sse2.hpp adapts to
the new store_masked(common) signature.
@DiamonDinoia DiamonDinoia force-pushed the fix/avx512vl-masked-memory branch from fa06792 to 5a40538 Compare June 1, 2026 16:53
@DiamonDinoia
Copy link
Copy Markdown
Contributor Author

@serge-sans-paille ready for a second round of review. Probably over commented because I chatted with @claude how to best do this in c++14 (No if constexpr...) my solutions where so SFINAE heavy so I asked multiple times how to simplify this and the final outcome is this one. I left the comments in to kind of explain a bit more, happy to trim/clean them after a second round of review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants